OpenStreetMap Project - Data Wrangling with MongoDB

Fernando Hernandez


Data used - MapZen Weekly OpenStreetMaps Metro Extracts

Data Cleaning Source Code: ipython notebook, github

My Nanodegree Portal: fch808.github.io
All Data Wrangling Exercises: ipython notebook

Map Areas:



**1. Problems Encountered**

After loading the smallest city, Honolulu, three main problems with the data showed up:

  • Street names with different types of abbreviations. (i.e. 'Clark Ave SE' or 'Eubank Northeast Ste E-18')
  • Inconsistent postal codes ('HI96826', '96826-0007', '89197')
  • Multiple cities needed to be accessed from one database

Address abbreviations

The initial cleaning script had a check for before and after address cleaning. This showed some new abbreviations which needed to be accounted for, such as directions (S,E,SE,etc) and location specific (Ste,Lp,etc.)


In [ ]:
for st_type, ways in abq_st_types.iteritems():
    for name in ways:
        better_name = update_name(name, mapping)
        if name != better_name:
            print name, "=>", better_name

In [ ]:
Honolulu:
Kalakaua Ave => Kalakaua Avenue
Lusitania St. => Lusitania Street
...

Albuquerque:
Valley View Dr NW => Valley View Drive Northwest
Eubank Northeast Ste E-18 => Eubank Northeast Suite E-18
...


Zip codes

Initially, some cleanup was needed for the zip codes. This included

  1. Removing the 4 digit postcode suffix.
  2. Removing state letters from postcode
  3. Converting to int (not strictly required)

After adding a zip code cleaning function, all zip codes adhered to a 5 digit code, but a few (such as 89197 and 87100) with only 1 entry may have been entered incorrectly since they may belong to other countries.


In [ ]:
def map_aggregate(db, collection, pipeline):
    db_collection = db[collection]
    result = db_collection.aggregate(pipeline)
    return result

# This function will be reused in further MongoDB explorations.
# Only the pipeline will need to be updated.
def make_city_pipeline(city):
    pipeline = [{"$match":{"address.postcode":{"$exists":1},
                                          "city_name":city}},   
                 {"$group": {"_id": {"City":"$city_name",
                                     "Zip":"$address.postcode"},
                            "count": {"$sum": 1}}},
                 {"$project": {'_id':0,
                               "City":"$_id.City",
                               "Zipcode":"$_id.Zip",
                               "Count":"$count"}},
                 {"$sort": {"Count": -1}},
                 {"$limit" : 10 }]
    return pipeline

pipeline = make_city_pipeline('honolulu')
result1 = map_aggregate(db, 'cities', pipeline)
pipeline = make_city_pipeline('albuquerque')
result2 = map_aggregate(db, 'cities', pipeline)

In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 219, u'Zipcode': 96815
              ...} 
 u'result': [{u'City': u'albuquerque', u'Count': 476, u'Zipcode': 87123
              ...}]}

Combining multiple cities

Two cities were chosen for this project. One city, Albuquerque, was my birth city, and the other, Honolulu, is where I had lived for the last 7 years. It would be interesting to compare the different distributions present in each city. This also posed a small problem of modifying the initial MongoDB loading script to accommodate multiple cities.

The solution was to add a 'city_name' tag name to be passed in with the json data when loading. We could then easily add this key-value pair to each record. We can also make use of projections to clean up what's returned since we are having to match cities first, then group by multiple conditions.


2. Data Overview

First, we can see some statistics for our dataset through MongoDB pymongo queries.

File Sizes


In [ ]:
albuquerque_new-mexico.osm - 53.8 MB
albuquerque_new-mexico.osm.json - 65.7 MBm
honolulu_hawaii.osm - 43.8 MB
honolulu_hawaii.osm.json - 54.4 MB

Number of documents


In [6]:
print "Both Cities:", db.cities.find().count() 
print "Honolulu:", db.cities.find({'city_name':'honolulu'}).count()
print "Albuquerque:", db.cities.find({'city_name':'albuquerque'}).count()

In [1]:
Both Cities: 489274
Honolulu: 227752
Albuquerque: 261522

Number of node nodes.


In [ ]:
print "Both Cities:", db.cities.find({'type':'node'}).count()
print "Honolulu:", db.cities.find({'type':'node', 'city_name':'honolulu'}).count()
print "Albuquerque:", db.cities.find({'type':'node','city_name':'albuquerque'}).count()

In [1]:
Both Cities: 432587
Honolulu: 206262
Albuquerque: 226325

Number of way nodes.


In [1]:
print "Both Cities:", db.cities.find({'type':'way'}).count()
print "Honolulu:", db.cities.find({'type':'way', 'city_name':'honolulu'}).count()
print "Albuquerque:", db.cities.find({'type':'way', 'city_name':'albuquerque'}).count()

In [1]:
Both Cities: 56648
Honolulu: 21459
Albuquerque: 35189

Total Number of contributors.


In [1]:
print "Contributors:", len(db.cities.distinct("created.user"))

In [1]:
Contributors: 611

3. Additional Data Exploration using MongoDB

Since we are reusing the make_city_pipeline() function defined earlier when examining zip codes, we can just show the pipeline that is passed in for our new MongoDB pymongo queries. Everything else would remain the same.


Parking is by far the most reported amenity in either city. It seem that it is most important to find places to park our automobiles, even on small islands.


In [ ]:
pipeline = [{"$match":{"amenity":{"$exists":1}, "city_name":city}},
            {"$group": {"_id": {"City":"$city_name", "Amenity":"$amenity"},
                        "count": {"$sum": 1}}},
            {"$project": {'_id':0,"City":"$_id.City",
                          "Amenity":"$_id.Amenity", "Count":"$count"}},
            {"$sort": {"Count": -1}},
            {"$limit" : 5 }]

In [ ]:
u'result': [{u'Amenity': u'parking', u'City': u'honolulu', u'Count': 280},
            {u'Amenity': u'restaurant', u'City': u'honolulu', u'Count': 123},..}
u'result': [{u'Amenity': u'parking', u'City': u'albuquerque',u'Count': 1270},
             {u'Amenity': u'school', u'City': u'albuquerque', u'Count': 258},..}

The top contributor for each city made up the lion's share of the edits. Interestingly, woodpeck_fixbot showed up in both cites taking the #3 spot in Honolulu and #5 spot in Albuquerque.


In [ ]:
pipeline = [{"$match":{"created.user":{"$exists":1},
                       "city_name":city}},
            {"$group": {"_id": {"City":"$city_name", "User":"$created.user"},
                        "count": {"$sum": 1}}}, 
            {"$project": {'_id':0, "City":"$_id.City",
                          "User":"$_id.User", "Count":"$count"}},
            {"$sort": {"Count": -1}},
            {"$limit" : 5 }]

In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 98401, u'User': u'Tom_Holland'},
             {u'City': u'honolulu', u'Count': 13051, u'User': u'ikiya'},...}
 u'result': [{u'City': u'albuquerque', u'Count': 88041, u'User': u'EdHillsman'},
             {u'City': u'albuquerque', u'Count': 37604, u'User': u'anjbe'},...}

Religion is bigger and more one-sided in the Southwest. Albuquerque is very Christian with plentiful churches. Honolulu doesn't report many churches and includes Buddhism as a major relative percentage of those entered.


In [ ]:
pipeline = [{"$match":{"amenity":{"$exists":1},
                       "amenity":"place_of_worship", "city_name":city}},
            {"$group":{"_id": {"City":"$city_name", "Religion":"$religion"},
                       "count":{"$sum":1}}},
            {"$project":{"_id":0, "City":"$_id.City",
                         "Religion":"$_id.Religion", "Count":"$count"}},
            {"$sort":{"Count":-1}},
            {"$limit":6}]

In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 14, u'Religion': u'christian'},
             {u'City': u'honolulu', u'Count': 5, u'Religion': u'buddhist'},..}
 u'result': [{u'City': u'albuquerque', u'Count': 186,u'Religion': u'christian'},
             {u'City': u'albuquerque', u'Count': 36, u'Religion': 'NoNameGiven'},..}

For restaurants, pizza is most popular in the isles; regional mexican in the SW.


In [ ]:
pipeline = [{"$match":{"amenity":{"$exists":1},
                       "amenity":"restaurant", "city_name":city}},
            {"$group":{"_id":{"City":"$city_name","Food":"$cuisine"},
                       "count":{"$sum":1}}},
            {"$project":{"_id":0, "City":"$_id.City",
                         "Food":"$_id.Food", "Count":"$count"}},
            {"$sort":{"Count":-1}}, 
            {"$limit":6}]

In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 7, u'Food': u'pizza'},
             {u'City': u'honolulu', u'Count': 3, u'Food': u'regional'},
             {u'City': u'honolulu', u'Count': 3, u'Food': u'japanese'}...}
 u'result': [{u'City': u'albuquerque', u'Count': 16, u'Food': u'mexican'},
             {u'City': u'albuquerque', u'Count': 10, u'Food': u'pizza'},
             {u'City': u'albuquerque', u'Count': 6, u'Food': u'american'}...}

Far-away cities have similar types of fast food entrenchment, until the third choice. Next, let's see who these chains are..


In [ ]:
pipeline = [{"$match":{"amenity":{"$exists":1},
                       "amenity":"fast_food", "city_name":city}},
            {"$group":{"_id":{"City":"$city_name", "Food":"$cuisine"},
                       "count":{"$sum":1}}},
            {"$project":{"_id":0, "City":"$_id.City",
                         "Food":"$_id.Food",
                         "Count":"$count"}},
            {"$sort":{"Count":-1}}, 
            {"$limit":6}]

In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 13, u'Food': u'burger'},
             {u'City': u'honolulu', u'Count': 2, u'Food': u'sandwich'},
             {u'City': u'honolulu', u'Count': 2, u'Food': u'sushi'},...}

{u'ok': 1.0,
 u'result': [{u'City': u'albuquerque', u'Count': 31, u'Food': u'burger'},
             {u'City': u'albuquerque', u'Count': 16, u'Food': u'sandwich'},
             {u'City': u'albuquerque', u'Count': 6, u'Food': u'pizza'},...}

Ubiquitous McDonald's and Subway have a far reach.


In [ ]:
pipeline = [{"$match":{"amenity":{"$exists":1},
                       "amenity":"fast_food","city_name":city}},
            {"$group":{"_id":{"City":"$city_name",
                                        "Name":"$name"},
                                 "count":{"$sum":1}}},
            {"$project":{"_id":0,
                                  "City":"$_id.City",
                                  "Name":"$_id.Name",
                                  "Count":"$count"}},
            {"$sort":{"Count":-1}}, 
            {"$limit":6}]

In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 8, u'Name': u"McDonald's"},
             {u'City': u'honolulu', u'Count': 4, u'Name': u'Subway'},
             {u'City': u'honolulu', u'Count': 3, u'Name': u'Burger King'},..}
 u'result': [{u'City': u'albuquerque', u'Count': 23, u'Name': u'Subway'},
             {u'City': u'albuquerque', u'Count': 12, u'Name': u"Blake's Lotaburger"},
             {u'City': u'albuquerque', u'Count': 11, u'Name': u"McDonald's"},...}

Auto repair shops are more popular in the expansive southwest. This makes sense since many more miles are put onto automobiles in the much larger state of New Mexico.


In [ ]:
pipeline = [{"$match":{"shop":{"$exists":1}, "city_name":city}},
                {"$group":{"_id":{"City":"$city_name", "Shop":"$shop"},
                           "count":{"$sum":1}}},
                {"$project": {'_id':0, "City":"$_id.City",
                              "Shop":"$_id.Shop", "Count":"$count"}},
                {"$sort":{"Count":-1}},
                {"$limit":10}]

In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 50, u'Shop': u'supermarket'},
             {u'City': u'honolulu', u'Count': 24, u'Shop': u'convenience'},
             {u'City': u'honolulu', u'Count': 18, u'Shop': u'clothes'},...}
 u'result': [{u'City': u'albuquerque', u'Count': 66, u'Shop': u'convenience'},
             {u'City': u'albuquerque', u'Count': 46, u'Shop': u'supermarket'},
             {u'City': u'albuquerque', u'Count': 22, u'Shop': u'car_repair'},...}

The most popular convenience stores are regional.


In [ ]:
pipeline = [{"$match":{"shop":{"$exists":1},
                       "city_name":city, "shop":"convenience"}},
            {"$group":{"_id":{"City":"$city_name", "Name":"$name"},
                       "count":{"$sum":1}}},
            {"$project": {'_id':0, "City":"$_id.City",
                          "Name":"$_id.Name", "Count":"$count"}},
            {"$sort":{"Count":-1}},
            {"$limit":5}]

In [ ]:
u'result': [{u'City': u'honolulu', u'Count': 7, u'Name': u'ABC Store'},...}
u'result': [{u'City': u'albuquerque', u'Count': 14, u'Name': u'Circle K'}...}

4. Conclusion

When checking the output of the data as it was being cleaned, I found that a lot of the earlier data was already cleaned fairly well. This makes sense since these are weekly extracts of data so newer data would not have had time to be cleaned by anyone.

It was also amazing to me how easily multiple cities could be aggregated, and how quickly new cleaning functions could be added into different sub-stages of cleaning.

It doesn't seem that it would be too hard to flesh out more robust cleaning algorithms, and run a script semi-regularly to clean the data automatically.

Also, I did not get a chance to use MongoDB's geospatial querying, but there is a fair amount of lat/lon data that might be able to use this type of querying.

These two cities are fairly small when compared to other more active cities in OpenStreetMap, so API pulls from other databases like google maps might be able to better populate points of interest for the less active areas.



In [19]:
from IPython import utils
from IPython.core.display import HTML
import os
def css_styling():
    """Load default custom.css file from ipython profile"""
    base = utils.path.get_ipython_dir()
    styles = "<style>\n%s\n</style>" % (open(os.path.join(base,'profile_custom1/static/custom/custom.css'),'r').read())
    return HTML(styles)
css_styling()


Out[19]: